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Abstract. The dictionary matching problem preprocesses a set of patterns and 
finds all occurrences of each of the patterns in a text when it is provided. We fo- 
cus on the dynamic setting, in which patterns can be inserted to and removed from 
the dictionary, without reprocessing the entire dictionary. This article presents the 
first algorithm that performs dynamic dictionary matching on two-dimensional 
data within small space. The time complexity of our algorithm is almost linear. 
The only slowdown is incurred by querying the compressed self-index that re- 
places the dictionary. The dictionary is updated in time proportional to the size of 
the pattern that is being inserted to or removed from the dictionary. Our algorithm 
is suitable for rectangular patterns that are of uniform size in one dimension. 

1 Introduction 

In the dictionary matching problem, the task is to identify a set of patterns, called a dic- 
tionary, within a given text. Applications for this problem include searching the World- 
Wide Web for specific keywords, scanning a file for virus signatures, and network in- 
trusion detection. The problem also has applications in the biological sciences, such 
as searching through a DNA sequence for a set of motifs. Dictionary matching gen- 
eralizes to the two-dimensional setting. Image identification software, which identifies 
smaller images in a large image based on a set of known images, is a direct application 
of dictionary matching on two-dimensional data. 

In recent years, there has been a massive proliferation of digital data. Some of 
the main contributors to this data explosion are the World-Wide Web, next genera- 
tion sequencing, and increased use of satellite imaging. Concurrently, industry has been 
producing equipment with ever-decreasing hardware availability. Thus, researchers are 
faced with scenarios in which this data growth must be accessible to applications run- 
ning on devices that have reduced storage capacity, such as mobile and satellite devices. 
Hardware resources are more limited, yet the data sets continue to escalate in size. The 
added constraint of performing efficient dictionary matching using little or no extra 
space is a challenging and practical problem. 

It is often the case that the dictionary of patterns will change over time. Efficient 
dynamic dictionary matching algorithms support insertion of a new pattern to the dic- 
tionary and removal of a pattern from the dictionary, e.g., [2, 3, 16, 4, 25, 7, 15]. They 



thereby eliminate the need to reprocess the entire dictionary and can adapt to changes 
as they occur. 

Idury and Schaffer developed a dynamic dictionary matching algorithm for rectan- 
gular patterns of different sizes. Their algorithm uses working space proportional to the 
input size and requires more than linear running time. The existing dynamic 2D dictio- 
nary matching algorithms for square patterns use linear working space and incorporate 
0{\og£) or 0(log^ £) slowdown in processing the text and in updating the dictionary 
[4, 8, 12]. 

The objective of this paper is to develop the first dynamic dictionary matching al- 
gorithm for two-dimensional data in the space-constrained environment. The existing 
static succinct 2D dictionary matching algorithm with no slowdown [22] is not suit- 
able for the dynamic setting. It reUes on the succinct ID dictionary matching algorithm 
of Hon et al. [14], which does not readily admit changes to the dictionary. In this pa- 
per, we extend the succinct 2D dictionary matching algorithm of [21], along with the 
improvements of [20], to the dynamic setting. We develop a dynamic algorithm that 
meets the time and space complexities that were achieved in the static version of the 
algorithm. The dictionary is initially processed in time proportional to the size of the 
dictionary. Subsequently, a pattern is inserted or removed in time proportional to the 
single pattern's size. We modify the witness tree [21] to form a dynamic data structure 
that meets the space and time complexities achieved by the static version. The dynamic 
witness tree accommodates insertion or removal of any string in time proportional to 
the string's length. 

We define Two-Dimensional Dynamic Dictionary Matching (2D-DDM) as follows. 

Initial Input: A dictionary of d patterns, D = {Pi, . . . ,Pd} and a text T of size 
ni X n2. Each Pj is of size rrij x m, with total size \D\ = I. 

Update Dictionary: Insert or remove a given pattern P, of size p xrn. 

Process Text: Find all occurrences of Pj, 1 < i < d, in T. 

In this paper we present a time and space efficient algorithm to solve 2D-DDM. 
During the preprocessing stage, our algorithm replaces the dictionary D with a com- 
pressed self-index. We use r to denote the time it takes to access a character or perform 
other queries in the compressed self-index of the dictionary; using recent results r is 
at most log^ £. The initial preprocessing of the dictionary completes in O^ir) time. A 
pattern P, of size p x m, is inserted to or removed from the dictionary in 0{pfnT) 
time. Our algorithm searches the text T in 0{nin2T) time. The extra space used by our 
algorithm is 0{dfn log dfn + dm' log dm') bits, where m' = max{mi , . . . , md}. 

The succinct 2D (static) dictionary matching algorithm of [21] was presented in 

terms of a dictionary in which all patterns are the same size in both dimensions, result- 
ing in a dictionary of size dm-^. In this paper, we generahze this result to deal with a 
dictionary of patterns that are of uniform width, but of varying heights. We perform a 
detailed analysis and distinguish between the sources of time complexities. Specifically, 
we analyze which time complexities are proportional to the uniform width of the pat- 



terns, m, which are proportional to the height-' of the largest pattern, m', and which are 
proportional to the actual dictionary size, L While doing this, we discovered the need 
for more efficient techniques in the verification process in order for the text scanning to 
remain linear in the size of the text, in the case of varying pattern heights. Herein Ues 
one of the contributions of this paper. 

We begin by presenting related work on ID dynamic dictionary matching in Sec- 
tion 2. Then, in the following section, we present a linear-time dynamic 2D dictionary 
matching algorithm that uses extra space proportional to the size of the input. In Section 
4, we describe a succinct variation of this linear space algorithm for a dictionary with 
a large number of patterns. For dictionaries in which the number of patterns is small 
relative to their size, we describe our approach in Section 5 . We conclude with open 
problems in Section 6. 



2 ID Dynamic Dictionary Matching 

In this section we summarize the existing algorithms for ID dynamic dictionary match- 
ing since we build our two-dimensional algorithm on one-dimensional algorithms. The 
ID dictionary consists of d one-dimensional patterns of total size i', drawn from an 
alphabet of size a. 

The first dynamic dictionary matching algorithms use suffix trees and incur an 
0{\ogi') slowdown in runtime [2, 3]. Idury and Schaffer [16] developed a dynamic 
version of the classic Aho-Corasick automaton [1] in which the dictionary is prepro- 
cessed in linear time. However, the tasks of updating the dictionary and scanning text 
require extra time. The culmination of work by separate groups of researchers on dy- 
namic dictionary matching [2, 3, 16] is an algorithm that mimics the Aho-Corasick 
automaton but stores the goto and report transitions separately [4]. The time complex- 
ity of this algorithm is close to linear, albeit with an 0{j^^—p) slowdown to update 
the dictionary or to scan text. 

Sahinalp and Vishkin achieved dynamic dictionary matching with no slowdown 
[25]. The preprocessing time of their algorithm is hnear in the size of the dictionary, 
text scanning is hnear in the size of the text, and the dictionary is updated in time pro- 
portional to the size of the pattern being added or removed. The time complexity of this 
algorithm meets the standard set by Aho and Corasick for static dictionary matching. 

Sahinalp and Vishkin's algorithm relies on compact tries and an original data struc- 
ture called a fat tree. Their algorithm employs a naming technique and identifies cores 
of each pattern using a compact representation of the fat tree. If a pattern matches a 
substring of the text, then the main core of the pattern and the text substring will neces- 
sarily be aligned. Conversely, if the main cores do not match, the text is easily filtered 
to a hmited number of positions at which a pattern can occur. Dictionary patterns are 
classified into groups according to the level of their main core. Then, an independent 

' We chose this notation since it is visual. The bar represents a uniform width, while the prime 
is vertical, representing a uniform height. 



data structure is built for each group, which consists of two compact tries. In total, this 
algorithm uses working space proportional to the size of the input. 

For dynamic dictionary matching in the space-constrained application, Chan et al. 
use the compressed suffix tree for succinct dictionary matching [7]. They build on the 
ideas of Amir and Farach [2] to use the suffix tree for dictionary matching. They replace 
the suffix tree with a compressed suffix tree developed by Sadakane [24], which is stored 
in 0{£') bits, and show how to make the data structure dynamic. They describe how to 
answer lowest marked ancestor queries by a balanced parenthesis representation of the 
nodes. The time complexity of inserting and removing a pattem and of scanning text 
has a slowdown of 0(log^ £'). 

An improved succinct dynamic dictionary matching algorithm was developed by 
Hon et al. [15]. It uses space that meets kth order empirical entropy bounds of the 
dictionary, I' Hk{D) + o{t' log a) + 0{d\og£') bits of space. The suffix tree is sampled 
to save space and an innovative method is proposed for a lowest marked ancestor data 
structure. They introduce the combination of a dynamic interval tree with a Dietz and 
Sleator order-maintenance data structure as a framework for answering lowest marked 
ancestor queries efficiently. Inserting or removing a dictionary pattern P, of length p, 
requires 0{p\og(j + log^') time and searching a text of length n requires 0{n \ogt' + 
OCX-) time. 

Hk{S), i.e., the k\h order empirical entropy of a string S, describes the minimum 
number of bits that are needed to encode each symbol of the string within context, and 
it is often used to demonstrate that storage space meets the information-theoretic lower 
bounds of data. 



3 2D-DDJVI in Linear Space 

In this section we present a linear-time dynamic 2D dictionary matching algorithm that 
uses extra space proportional to the size of the input. The first linear-time 2D single 
pattem matching algorithm was developed independently by Bird [6] and by Baker [5]. 
They translate the 2D pattem matching problem into a ID pattern matching problem. 
Rows of the pattem are perceived as metacharacters and named so that distinct rows 
receive different names. The text is named in a similar fashion and ID pattem matching 
is performed over the text colunms and the pattem of names. 

The Bird / Baker algorithm readily extends to dictionary matching by replacing the 
ID single pattern matching mechanism, a Knuth-Morris-Pratt automaton, with ID dic- 
tionary matching, an Aho-Corasick automaton. In the multiple pattern matching version 
of the Bird / Baker algorithm, ID dictionary matching is used in two different ways. 
First, the pattern rows are seen as a ID dictionary and this set of "patterns" is used 
to linearize the dictionary and then to label text positions. A separate ID dictionary is 
formed of the Unearized 2D pattems. The Bird / Baker algorithm is suitable for 2D pat- 
terns that are of uniform size in at least one dimension, so that the text can be marked 
with at most one name at each text location. The Bird / Baker method uses linear time 
and space in both the pattem preprocessing and the text scanning stages. 



Sahinalp and Vishkin's [25] dynamic ID dictionary matching algorithm (SV) uses 
a naming technique rather than a dictionary-matching automaton. Yet, it is a suitable 
replacement for the Aho-Corasick automata in the Bird / Baker algorithm. Thus, the 
combination of these techniques, one for dynamic dictionary matching in ID and the 
other for static 2D dictionary matching, yields a dynamic 2D dictionary matching al- 
gorithm that runs in Unear time. This modification extends the Bird / Baker algorithm 
to accommodate a changing dictionary, yet it does not introduce any slowdown. We 
outiine this process in Algorithm 1. 



Algorithm 1 Dynamic Version of Bird / Baker Algorithm 

{1} Preproccss Pattern: 

a) Name pattern rows using SV [25]. 

b) Store ID pattern of names for each pattern in D, called D' . 

c) Preprocess D' using SV to later perform ID dynamic dictionary matching. 
{2} Row Matching: 

Use SV on each row of text to find occurrences of D's pattern rows. 
This labels positions at which a pattern row ends. 
{3} Column Matching: 

Run SV on named columns of text to find occurrences of patterns from D' in the text. 
Output pattern occurrences. 



Initially, the dictionary of pattern rows is empty. One 2D pattem is linearized at a 
time, row by row. As a pattern row is examined, it can be viewed as a text on which to 
perform dictionary matching. If a pattem row is identified in the new pattem row, then 
it is given the same name as the matching row. Otherwise, this new row is seen as a 
new ID pattem and added to the dictionary of pattern rows. Once the pattem rows have 
been given names, the ID pattems of names, D', are preprocessed separately. 

Whenever a pattem is added to or removed from the 2D dictionary, the precomputed 
information about the pattems can be adjusted in time proportional to the size of the 
2D pattem that is entering or leaving the dictionary. That is, Sahinalp and Vishkin's 
framework for dictionary matching allows both ID dictionaries to efficiently react to a 
change in the 2D linearized dictionary that they represent. 

Space complexity of Algorithm 1: The dynamic version we present of the Bird / 
Baker algorithm uses extra space proportional to the size of the input. It uses 0(f log t) 

bits of extra space to name the pattem rows using SV [25] and 0{dm' log dm') bits 
of extra space to store and index the ID representation of the patterns. During text 
scanning, 0(n2logn2) bits of space are used to ran SV on each row of text and 
0{ni logrii) bits of space are used to mn SV on the named columns of text, one at 
a time. 0{n\n2 log dm') bits of extra space are used to store the names given to text 
positions. 



4 2D-DDM in Small-Space For Large Number of Patterns 



The dynamic version of the Bird / Baker algorithm presented in Section 3 uses Unear 
working space. In this section we present a variation of Algorithm 1 that runs in small 
space for a dictionary in which d >m. That is, when the number of patterns is larger 
than the width of a pattern. 

We begin by modifying Algorithm 1 to work with small blocks of text and thereby 
relate the extra space to the size of the dictionary, not the size of the text. We use a 
known technique for minimizing space and process the text in small overlapping blocks 
of size 3m'/2 x 3m/2. Since each text block is processed in time proportional to the 
size of the text block, the overall text scanning time remains Unear. 

By processing one text block at a time, we reduce the working space to 0{l log £ + 
dm' log dm') bits of extra space to preprocess the patterns and 0{m log m+mm' log dm') 
bits of extra space to search the text. This change does not affect the time complexity. 
We seek to further reduce the working space by employing a smaller space mechanism 
to name the pattern rows and subsequently name the text positions. 

Recent innovations in succinct full-text indexing provide us with the ability to com- 
press a suffix tree, using space that is proportional to the entropy of the original data it 
is built upon. These self-indexes can replace the original text, as they support retrieval 
of the original text, in addition to answering queries about the data, very quickly. 

Several dynamic compressed suffix tree representations have been developed, each 
offering a different time/space trade-off. Chan et al. presented a dynamic suffix tree 
that occupies 0{t) bits of space [7]. Queries, such as edge label retrieval and insertion 
or removal of a substring, have an 0(log^ £) slowdown. Russo et al. developed a dy- 
namic fully-compressed suffix tree requiring IHf. [t] +o{t log a) bits of space, which is 
asymptotically optimal under fcth order empirical entropy [23]. This compressed suffix 
tree representation uses a dynamic compressed suffix array and stores a sample of the 
suffix tree nodes. Although some operations can be executed more quickly, all opera- 
tions have 0(log^ £) time complexity. This dynamic compressed suffix tree supports a 
larger set of suffix tree navigation operations than the compressed suffix tree proposed 
by Chan et al. [7]. It also reaches a better space complexity and can perform basic op- 
erations more quickly. We hereafter suppose that a dynamic compressed suffix tree is 
used to replace the dictionary of pattems and we refer to the slowdown of operations in 
the entropy-compressed self-index as r. 

We now describe a succinct version of Algorithm 1 that uses a dynamic compressed 
suffix tree to represent and index the pattem rows in entropy-compressed space. Its 
modifications are limited to steps la and 2 in Algorithm 1. Traversing the dynamic 
compressed suffix tree introduces r slowdown in running time. During pattern prepro- 
cessing, the dynamic compressed suffix tree is built incrementally, as each pattern row 
is named. First, traversal of the suffix tree is attempted by traversing a path from the 
root labeled by the characters in the pattern row. If a matching row is found, the new 
row is given the same name as the row that it matches. Otherwise, the new pattern row 
is inserted into the compressed suffix tree and given a new name. 



The positions of a text block row are also named by traversing the suffix tree. Here 
the suffix tree is not modified by the text. We use a technique similar to the one described 
by Gusfield in the computation of matching statistics, [13] Section 7.8. Positions in a 
text block are named, row by row, according to the names of pattem rows. To name a 
new text block row, traversal begins at the root of the tree, with the edge whose label 
matches the first position of the text block row. When m consecutive characters trace a 
path from the root, traversal reaches a leaf, and the position is named with the matching 
pattern row. At a mismatch, suffix links quickly find the longest suffix of the already 
matched string that matches a prefix of some pattem row and the next text character is 
compared to that labeled edge of the tree. 

All pattem rows have width m. This ensures that each text position can be uniquely 
labeled. One pattern row cannot be a substring of another Thus, we do not share the 
concem of Amir and Farach's suffix tree based approach to dictionary matching [2]. 
They use lowest marked ancestor queries to address the issue of possibly skipping over 
pattern occurrences in the case that one pattem is a substring of another and a suffix 
link is traversed. 

Theorem 1. If d > In, we can solve the dynamic 2D dictionary matching problem in 
almost linear 0{{C. + nin2)T) time and 0(m log m + dm! log dm') bits of extra space, 
aside from the space used to represent the dictionary in a compressed self-index. Pattern 
P of size p xrrl. can be inserted to or removed from the dictionary in 0{prriT) time and 
the updated index will occupy an additional 0{p\ogdm') bits of space, where m' is 
updated to reflect the new maximum pattern height. 



5 2D-DDM in Small-Space for Small Number of Patterns 

This section deals with the case in which the number of patterns is smaller than the 
common dimension among all dictionary patterns, i.e., d = o{rn). For this case, we do 
not allow the space to label each text block location and therefore the dynamic version 
of the Bird and Baker algorithm cannot be applied trivially. We present several combi- 
natorial tricks to preserve the spirit of Bird and Baker's algorithm without incurring the 
necessary storage overhead. The dictionary is indexed by a dynamic compressed suffix 
tree, after which the pattems can be discarded. This can be done in space that meets 
A;th order empirical entropy bounds of the input, as described in Section 4. Thus, the 
compressed self-index does not occupy extra space. Throughout this section, the ex- 
tra space used by our algorithm is Umited to 0(mlog m + dm' log dm') bits of space. 
The ranning time of our algorithm is almost hnear, with a slowdown to acconomodate 
queries to the compressed suffix tree, referred to as t. 

We divide the dictionary pattems into two groups and search the text for patterns 
in each group separately. In the following sections, we describe first an algorithm for 
patterns in which the rows are highly periodic and then an algorithm for all other pat- 
terns. We begin by describing a dynamic data stracture that is used by both parts of the 
algorithm. 



5.1 Dynamic Witness Tree 



In this section we show how to form a dynamic variant of the witness tree, a data 
structure that was introduced in [21]. Given a set S of j strings, each of length m, 
a witness tree can be constructed to name these strings in linear 0{jm) time and in 
0{j log j) bits of space so that identical strings receive the same name [21]. An internal 
node in the witness tree denotes a position of mismatch, which is an integer e [1, m]. 
Each edge of the tree is labeled with a single character Sibling edges must have different 
labels. A leaf represents a name given to string(s) in S. 

Query: For any two strings s, s' G S, retum a position of mismatch between s and s' 
if s ^ s', otherwise retum m + 1. 

Preprocessing the witness tree for Lowest Common Ancestor (LCA) queries on its 
leaves allows us to answer the above witness query between any two named strings in 
S in constant time. This preprocessing can be performed in linear time and space, with 
respect to the size of the tree, even for a dynamically changing tree [10]. 

Construction of the witness tree begins by choosing any two strings in S and com- 
paring them sequentially. When a mismatch is found, comparison halts and an internal 
node is created to represent this witness of mismatch, with two children to represent 
the names of the two strings. If no mismatch is found, the two strings are given the 
same name. Each successive string is compared to the witnesses stored in the tree by 
traversing a path from the root to identify to which name, if any, the string belongs. 
Characters of a new string are examined in the order dictated by traversal of the witness 
tree, possibly out of sequence. If traversal halts at an internal node, the string receives a 
new name, and a new leaf is added as a child to the internal node. Otherwise, traversal 
halts at a leaf, and the new string is compared sequentially to the string represented by 
the leaf, as done with the first two strings. 

Now we consider the scenario in which 5 is a dynamically changing set of strings. 

Lemma 1. A new string is added to the witness tree in 0{m) time. 

Proof. Including a new string in S and naming it with the witness tree follows the same 
procedure that the static witness tree uses to build the witness tree as each pattern is 
considered individually. This is done in 0{m) time and adds one or zero nodes to the 
witness tree [21]. □ 

Lemma 2. A string is removed from the witness tree m 0(1) time. 

Proof. In removing a string s from S, there are two possibilities to consider If s is the 
only string with its name, remove its leaf. In the event that the parent is an internal node 
with only one other child, remove the hanging internal node as well. Then, the sibling 
of the deleted leaf becomes a child of its grandparent. The other possibility is that some 
other string(s) in S bear the same name as s. We do not want to remove a leaf while 
there is still a string in S that has its name. Thus, we augment each leaf with an integer 
field to store the number of strings in S that have its name. This counter is increased 
when a new string is named with an existing name. This counter is decreased when a 



row is deleted. When the counter is down to 0, the leaf is discarded, possibly along with 
its parent node, as described earlier. □ 



Observation 1 The dynamic witness tree of j strings, each of length m, occupies 0{j log j) 
bits of space. 

5.2 Group I Patterns 

A string S is primitive if it cannot be expressed in the form S = u\ for j > 1 and 
any prefix u of S. String S is periodic in m if 5 = u^u' where u' is a prefix of u, 
u is primitive, and j > 2. A periodic string can be expressed as vPu' for one unique 
primitive u. We refer to u as "the period" of p. Depending on the context, we use the 
term period to refer to either the string u or the period size \u\. 

There are two types of patterns, and each one presents its own difficulty. In the 
initial preprocessing step, we divide the patterns into two groups based on the ID pe- 
riodicity of their rows. In Group I, all pattern rows are periodic, with periods < m/4. 
The difficulty in this case is that many overlapping occurrences can appear in the text in 
close proximity to each other, and we can easily have more candidates than the working 
space we allow. Patterns in Group II have at least one aperiodic row or one row whose 
period is larger than m/4. Here, each pattern can occur only 0(1) times in a text block. 
Since several patterns can overlap each other in both directions, a difficulty arises in the 
text scanning stage. We do not allow the time to verify different candidates separately, 
nor do we allow space to keep track of the possible overlaps between different patterns. 

5.2.1 Preprocessing Dictionary 

For patterns in Group 1, we linearize the pattems with Lyndon word naming [21] on 
the rows. Two strings are conjugate if they differ only by a cyclic permutation of their 
characters. A Lyndon word is a primitive string which is the smallest of its conjugates 
for the alphabetic ordering. Lyndon word naming classifies strings by the conjugacy of 
their periods and uses the Lyndon word as the class representative. Once Lyndon word 
naming has been performed, each pattern row is represented by the name of its period's 
class and its LWpos, the first position at which the Lyndon word begins in the row. We 
use the dynamic witness tree to perform Lyndon word naming in hnear time. 

A pattern occurs in a text block if the ID representations are the same and the 
periods align within each row. The 2D Lyndon word is a succinct representation of 
the Lyndon word that is conjugate to each row's period combined with the relative 
alignments of the Lyndon words among the matrix rows. 2D Lyndon word naming 
forms equivalence classes of patterns with the same ID name and uses the 2D Lyndon 
word in each class as the class representative. The 2D Lyndon word that represents an 
rrii xm matrix is computed in sublinear time and 0{mi log m.i) bits of working space 
[20]. We classify the pattems with 2D Lyndon word naming so that the text scanning 
stage can efficiently verify pattems occurrences. 



The distance between any two overlapping occurrences of Pi in the same row is the 
Least Common Multiple (LCM) of the periods of all rows of Pj. We precompute the 
LCM of each pattern so that 0(1) space suffices to store all occurrences of a pattern in 
a row, and 0{dm' log dm') bits of space suffice to store all patterns occurrences. The 
LCM is computed incrementally, row by row. The LCM table stores the LCM of the 
periods of the first i rows of the pattern as LCM[i], for 1 < i < m^, and is available 
during text scanning. Although the LCM can be exponential in m', we only need the 
elements of the LCM table that are polynomial in m', as discussed in [20]. 

The following preprocessing steps are initially performed for each dictionary pattern 
in Group I and are later used upon arrival of a new pattern. 

1 . For each pattern row, 

(a) Compute period and canonize. 

(b) Lyndon word naming with dynamic witness tree, resulting in ID dictionary D'. 

(c) Insert to dynamic compressed suffix tree. 

2. Preprocess ID dictionary: 

(a) Preprocess D' for dynamic dictionary matching. 

(b) Build LCM table for each ID pattern. 

(c) For each Unearized pattern whose ID form is not periodic or if m' = 0{m): 
Compute 2D Lyndon word and column z it occurs in. 

(d) For each hnearized pattern whose ID form is periodic when m = o{m'): 

i. Compute 2D Lyndon word and the colunrn z it occurs in for each pjblock, 
a period in the ID pattern. 

ii. Classify pMocks by 2D Lyndon word naming. 

iii. Compute the difference between z in adjacent pMocks. 

iv. Build KMP automaton for named pMocks and the differences between 
their z values. 



Lemma 3. Patterns in Group I are preprocessed in 0{£t) time and 0(mlogm + 
dm' log dm') bits of extra space. 

Proof. Step 1 processes a single pattern row in O {TFit) time and O (m log m) bits of ex- 
tra space [21]. Thus, the entire set of pattern rows are processed in 0{i) time to gather 
information and 0(£r) time to index the pattern rows in a dynamic compressed suffix 
tree. Since 0(1) information is stored per row, 0{dm' log dm') bits of extra space are 
used to store information gathered about the pattern rows in the dictionary. 
Step 2 preprocesses the ID patterns in the dictionary of names. Using Sahinalp and 
Vishkin's algorithm, O(dm') time and 0{dm! \ogdm') bits of extra space are used to 
facihtate linear time dynamic dictionary matching in a ID dictionary of size 0{dm') 
[25]. The LCM tables of the ID patterns are computed in Unear time and occupy 
0{dm' logm') bits of extra space. The 2D Lyndon word of each pattem is computed 
in sublinear time with respect to its size and the set of 2D Lyndon words occupy 
0{dm' log dm') bits of extra space [20]. Similarly, p.blocks are classified by their rep- 
resentative 2D Lyndon words in sublinear time and a KMP automaton of the p -blocks is 
constructed in 0{m') time [18]. Overall, Step 2 runs in 0{dm') time and 0{dm' log dm') 
bits of extra space. □ 



Corollary 1. A new pattern of size p x m is added to Group I in 0{pmT) time and 
0{m log m + p log dm') bits of extra space. 

Lemma 4. A pattern in Group I of size pxm is removed from the dictionary in O {pmr) 
time and eliminates 0{jn\ogm+ plogdm! ) bits of extra space the algorithm allocated 
for it 

Proof. The following steps meet the indicated time and space bounds and remove a pat- 
tern from Group I. Each pattern row is removed from the dynamic witness tree, in 0(1) 
time (by Section 5.1) , and from the dynamic compressed suffix tree, in 0{rnT) time. 
This takes 0{prnT) time in total. If this is the only pattern with its ID representation, 
its LCM table is deleted and the ID pattern is removed from the dictionary of names 
that has been preprocessed for dynamic dictionary matching. If this is one of several 
patterns with the same ID representation, and the sole member of its consistency class, 
its representative 2D Lyndon word is removed from the compressed trie. □ 

5.2.2 Text Scamiing 

The text is searched for occurrences of patterns in Group I in a three step process. First, 
Lyndon word naming is performed on the rows of the text block using the dynamic wit- 
ness tree of the dictionary (Section 5.1). We store the name of its period's class, period 
size, LWpos, right, and left of each pattern row. Then, the Unearized text, T', is searched 
for candidate positions that match a pattem in the ID dictionary using ID dynamic dic- 
tionary matching, since the pattems can be of varying heights. Finally, the verification 
step finds the actual pattem occurrences among the candidates. Since the first two steps 
have been described, the remainder of this section discusses the verification stage. 

To verify candidates, we consider the alignment of periods among rows and the 
overall width of the ID names in the text block. If m' = 0(m), we can use a verification 
procedure almost identical to the procedure that appears in [21] . However, if the uniform 
width, m, is asymptotically smaller than the height of the tallest pattern, m', then this 
algorithm does not yield a linear time text scanning. This is due to the fact that the 
algorithm costs O(m') time to process each candidate row, resulting in 0{m' * m') 
time if TO = o{m'). For this situation, new ideas are needed and we introduce a new 
verification process that verifies a single pattern in O(to') time. Since the dictionary has 
d pattems, and d < m, the entire text block is verified in 0{mm') time. 

We verify candidates for each pattem, Pj, separately. Verification of each candidate 
consists of two tasks: 

1. Verify shifts: Let P/ be the ID pattem of names for Pi. If P' is not periodic, there 
are 0(1) candidates in a text block, and we verify each candidate for Pi separately 
by matching Pj's 2D Lyndon word with the 2D Lyndon word of the corresponding 
rows of the text block. If P/ is periodic, the idea is similar. We call each period 
in P- a p -block. We first verify the shifts within each pMock and then verify the 
shifts between adjacent p.Wocfcs. We compute the 2D Lyndon word of each pMock 



separately and store the column z that it occurs in. Since each p_block has the same 
horizontal period (i.e., the LCM of the periods of the rows of a pMock), we use 
a Knuth-Morris-Pratt automaton [18] on the pjblocks to complete the verification. 
The KMP automaton verifies that corresponding pMocks have the same name and 
that the difference between the z values of adjacent p_WocA;s is the same in the text 
and in the pattern. 

2. Check width: Use range minimum and maximum queries to calculate minRight 
and maxLeft for each candidate of Pj. Then, reverse the shift and make sure that 
there is room for the pattern between minRight and maxLeft, i.e., that the candidate 
spans at least m columns. 

Lemma 5. A text of size ni x n2 is searched for patterns in Group I in 0{nin2T) time 
and 0{m log m + m' log dm') bits of extra space. 

Proof. The linear representation of the text block is computed in Oijnm') time and 
occupies 0{m' log dm') bits of space, as shown in Section 5.2.1. Candidates are identi- 
fied with Sahinalp and Vishkin's algorithm [25] in time linear in the ID representations. 
Verification as done in [21] is linear. It remains to show that the new verification, when 
m = o(m'), runs in Unear time. Computing the 2D Lyndon word for the entire text 
block or for each of the pMocks in the text block takes O(m') time. KMP on the 
2D Lyndon words of the p-blocks and the shifts between pMocks takes 0{m') time. 
Thus, Pi is verified in 0(m + m') time, and all d patterns are verified in 0{dm') time, 
since d < m. Linear time and space preprocessing schemes allow us to answer range 
minimum and maximum queries in 0(1) time [11]. Check- width (Step 2) consists of 
constant-time RMQ per candidate, which totals 0{m') time overall for Pi, and for all 
Pi, i ^ i ^ d, text scanning completes in 0{mm') time. 

Each block of text is searched in O {mrn'r) time and O {m log m + m' log dm' ) bits of 
extra space. Thus, the entire text is searched for pattems in Group I in 0(nin2r) time 
and 0(m log m + m' log dm') bits of extra space. □ 

5.3 Group n Pattems 

Pattems in Group 11 have at least one aperiodic row or one row whose period is larger 
than rn/4. We assume that each pattern in this group has at least one aperiodic row. The 
case of a pattern having a row that is periodic with period size between m/4 and m/2 
is handled similarly, since each pattern can occur only 0(1) times per text block row. 

For pattems in Group II, many different pattern rows can overlap in a text block 
row. As a result, it is difficult to employ a succinct naming scheme to linearize the 
text block and find all occurrences of pattems in the text. Instead, we use the aperiodic 
row of each pattem to filter the text block and identify a limited set of candidates for 
pattern occurrences. We use dynamic dueling [21] to eliminate inconsistent candidates 
within each text column. Then, a single pass over the text suffices to verify all remaining 
candidates for pattem occurrences. 



5.3.1 Preprocessing Patterns 



The following preprocessing steps are initially performed for each dictionary pattern in 
Group II and are later used upon arrival of a new pattern. 

1 . Locate first aperiodic row and preprocess for dynamic dictionary matching. 

2. Name pattern rows using a single witness tree and store ID patterns of names. 

3. Insert pattern rows to dynamic compressed suffix tree. 

4. Construct dynamic suffix tree of ID pattems. 

5. Preprocess witness tree and suffix tree for dynamic LCA. 

Lemma 6. Pattems in Group 11 are preprocessed in 0{It) time and 0{dfn\ogdm + 
dm' log dm') bits of extra space. 

Proof. In Step 1, the period of a pattern row is computed in 0(m) time and 0{rn log to) 
bits of extra space [19]. At most, all pattern rows are examined, in 0{£) time and 
O(TOlogTO) bits of extra space. Sahinalp and Vishkin's algorithm indexes these rows 
in 0{dffi) time and 0{dm log dm) bits of space [25]. Step 2 names pattern rows by the 
witness tree in 0{(?) time. By Section 5.1 , the dynamic witness tree of pattem rows 
occupies 0{dm' log dm') bits of space. A single witness tree suffices since all pattern 
rows are the same size. Step 3 indexes the pattem rows in a dynamic compressed suffix 
tree in 0{£t) time. Step 4 constructs the dynamic suffix tree in 0{dm') time and stores 
it in 0{dm' log dm') bits of space [9]. In Step 5, linear time preprocessing prepares the 
dynamic suffix and witness trees for 0(1) time LCA queries [10]. □ 

Corollary 2. The dictionary is updated to add a new pattem of size p xmto Group II 
inO{pmT) time and 0{pm log dm + p log dm') bits of extra space. 

Lemma 7. A pattern in Group II of size p x m is removed from the dictionary in 
0{pmT) time and eliminates 0{rrilogm+plog dm') bits of extra space the algorithm 
allocated for it. 

Proof. The following steps are performed to remove a pattem from Group II. The first 
aperiodic row of the pattem is removed from the ID dictionary that has been prepro- 
cessed for dynamic dictionary matching in O (to) time and deallocates O (m log m) bits 
of space [25]. The ID representation of the pattem is deleted and it is removed from 
the suffix tree of ID pattems in 0{p) time and deallocates 0{p log dm') bits of space 
[9]. Each row of the pattem is removed from the compressed suffix tree in 0{pmT) 
time. □ 

5.3.2 Text Scanning 

The text is searched for pattems in Group II in almost the same way as in the static 
algorithm [21]. The only difference between the text scanning stage of the static algo- 
rithm and that of the dynamic algorithm lies in the method used to identify ID pat- 
tem occurrences in the linearized text. The Aho-Corasick automaton is not suitable for 



a dynamic dictionary since it is not updated efficiently. Rather, we use Sahinalp and 
Vishkin's method for dynamic dictionary matching since it completes all preprocess- 
ing and searching tasks, including updating the dictionary, in linear time and space. We 
summarize the text scanning and the complexity analysis in the following. 

Summary of Text Scanning 

1. Identify candidates: Sahinalp and Vishkin's ID dynamic dictionary matching al- 
gorithm finds occurrences of the first aperiodic row of the patterns. It searches the 
text block, one row at a time, 

2. Duel vertically: 

(a) An LCP query between suffixes of the ID patterns finds the number of rows 
that match in overlapping candidates. An LCA query in the suffix tree of ID 
patterns is performed to find a row of mismatch. 

(b) We use an LCA query in the witness tree to find a witness of mismatch between 
rows of different names. Then a single character in each pattern row is retrieved 
and compared. 

3. Verify candidates: We verify one text block row at a time and mark positions at 
which a pattern row (ID name) is expected to begin. Duels ehminate horizontally 
inconsistent candidates. A duel consists of an LCP query in the dynamic com- 
pressed suffix tree. After duels are performed, the surviving labels are carried to 
the next row. 

Lemma 8. A text of size ni x n2 is searched for patterns in Group II in 0{nin2T) time 
and 0{m log m -|- dm' log dm') bits of extra space. 

Proof. Step 1 searches each text block row for a single row of each pattern in 0{rnm') 
time and 0{rn log to) bits of extra space [25]. 0{dm') candidates are stored in 0{dm' log dm') 
bits of extra space. In Step 2, each vertical duel consists of an 0(l)-time LCP query 
in the suffix tree, an 0(l)-time LCA query in the witness tree, and an 0(T)-time char- 
acter retrieval and comparison in a pair of pattem rows. Overall, each duel takes 0(t) 
time. Due to transitivity, the number of duels is limited by the number of candidates. 
Since there are 0{dm') candidate positions, and d <m, the vertical duels complete in 
0{mm'T) time. In Step 3, an LCP query in the dynamic compressed suffix tree takes 
0{t) time. By transitivity, the number of duels is limited by the number of candidates, 
which are 0{dm'). Since d < m, dueling is completed in 0{mm'T) time. Verifica- 
tion uses space proportional to the labels for one text block row plus the number of 
candidates, 0(to log to + dm' log dm') bits. Each text character within an anticipated 
pattern occurrence is only compared to one pattem character, in 0{t) time, which takes 
0(jnm'T) time overall. 

Each block of text is searched in Oijurn'r) time and 0(to log rn + dm' log dm') bits 
of extra space. Thus, the entire text is searched for patterns in Group II in 0{nin2T) 
time and 0{m log m -|- dm' log dm') bits of extra space. □ 

Tlieorem 2. Our algorithm for dynamic 2D dictionary matching when d < m com- 
pletes in 0{{£ + nin2)T) time and 0{dfnlogdm + dm' log dm') bits of extra space. 



Pattern P of size p xm can be inserted to or removed from the dictionary in 0{pmT) 
time and the index will occupy an additional 0{p\ogdm') bits of space, where m' is 
updated to reflect the new maximum pattern height. 

Proof We separate the patterns into two groups and search for patterns in each group 
separately. Classifying a pattern entails finding the period of each pattern row. This is 
done in 0(m) time and 0(m log m) bits of extra space per row [19]. Overall, the dic- 
tionary is separated into two groups in 0(£) time and O(mlogm) bits of extra space. 
For patterns in Group I, this complexity is demonstrated by Lemmas 3, 4, 5 and Corol- 
lary 1 . For patterns in Group II, this complexity is demonstrated by Lemmas 6,7,8 and 
Corollary 2. □ 

6 Conclusion 

We have presented the first efficient dynamic 2D dictionary matching algorithm that 
runs in sublinear working space. The algorithm is a succinct and dynamic version of the 
classic Bird / Baker algorithm. Since we follow their labeling paradigm, our algorithm 
is suited for a dictionary of rectangular pattems that are all the same size in at least one 
dimension. Our algorithm uses a dynamic compressed suffix tree as a compressed self- 
index to represent the dictionary in entropy-compressed space. All tasks are completed 
by our algorithm in Unear time, overlooking the slowdown in querying the compressed 
suffix tree. 

When the rectangular pattems are of different height, width and aspect ratios, a 
method that labels text positions is not appropriate. Idury and Schaffer developed a 
dynamic dictionary matching algorithm for such pattems [17]. Their algorithm uses 
techniques for multidimensional range searching as well as several appUcations of the 
Bird / Baker algorithm, after splitting each pattem into overlapping pieces and handling 
these segments in groups of uniform height. Idury and Schaffer' s algorithm requires 
working space proportional to the dictionary size. We hope that our succinct dynamic 
version of the Bird / Baker algorithm is a first step towards addressing the more general 
problem of succinct dynamic 2D dictionary matching among all rectangular pattems. 

Many problems related to succinct dynamic dictionary matching remain open. In 
future work we hope to address succinct 2D dictionary matching when the pattem oc- 
currences can be approximately matched to the text. The approximate matches may 
accommodate character mismatches, insertions, deletions, "don't care" characters, or 
swaps. 
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